# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
데이터 탐색을 하라 (EDA : str, summary 이용)
Code
str(df)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Code
summary(df)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
NA가 있는 열 확인하라
Code
colSums(is.na(df))
species island bill_length_mm bill_depth_mm
0 0 2 2
flipper_length_mm body_mass_g sex year
2 2 11 0
# A tibble: 7 × 3
color carat평균 price중간값
<ord> <dbl> <dbl>
1 J 1.16 4234
2 I 1.03 3730
3 H 0.91 3460
4 F 0.74 2344.
5 G 0.77 2242
6 D 0.66 1838
7 E 0.66 1739
5. cut이 Premium 인 것중에서 carat이 가장 큰 값을 가지는 diamond의 가격은 얼마인가?
PassengerId: 각 승객에게 주어진 고유 ID 번호
Survived: 승객이 생존(1)했는지 사망(0)했는지 여부
Pclass: 승객 등급
Name: 이름
Sex: 승객의 성별
Age: 승객의 나이
SibSp: 형제자매/배우자의 수
Parch: 부모/자녀의 수
Ticket: 티켓 번호
Fare: 티켓에 대해 지불한 금액
Cabin: 객실 카테고리
Embarked: 승객이 탑승한 항구(C = Cherbourg, Q = Queenstown, S = Southampton)
r에서는 타이타닉 데이터를 좀더 간편하게 만든 내장데이터가 있다. data를 아래와 같이 불러와서 titanic 변수에 넣고 시작하자.
titanic <- as.data.frame(Titanic)
Code
titanic <-as.data.frame(Titanic)head(titanic)
Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
1. 탑승자 중 여자 아이의 총 수는 몇명인가?
Code
titanic |>filter(Sex =="Female"& Age =="Child") |>summarise(n =sum(Freq))
n
1 45
2. Crew중 여자 어른의 수는 몇명인가?
Code
titanic |>filter(Sex =="Female"& Class =="Crew") |>summarise(n =sum(Freq))
n
1 23
3.Sex별, Age별 생존자가 몇명인지 보이시오.
Code
titanic |>filter(Survived =="Yes") |>group_by(Sex, Age ) |>summarise(생존자 =sum(Freq))
# A tibble: 4 × 3
# Groups: Sex [2]
Sex Age 생존자
<fct> <fct> <dbl>
1 Male Child 29
2 Male Adult 338
3 Female Child 28
4 Female Adult 316
# A tibble: 6 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Code
mpg %>%filter(cyl %in%c(4,6,8)) %>%ggplot(aes(x=factor(cyl), y=hwy, fill=factor(cyl)))+ ggdist::stat_halfeye(adjust=0.5,justification=-.2,.width =0,width=0.4,point_colour=NA )+ ggdist::stat_dots(side="left",justification =1.1,binwidth = .25 )+scale_fill_tq()+theme_tq()+labs(title="Raincloud Plot",subtitle ="showing the bi-modal distribution of 6 cylinder vehicle",x="engine size",y="highway fuel economy",fill="cylinders")+#coord_flip()+geom_boxplot(width=.12,outlier.color =NA,alpha=0.5 )
df |>arrange(-size)|>slice(1:10) |>gt() |>tab_header(title ="Large Landmasses of the World",subtitle ="The top ten largest are presented" )
Large Landmasses of the World
The top ten largest are presented
name
size
Asia
16988
Africa
11506
North America
9390
South America
6795
Antarctica
5500
Europe
3745
Australia
2968
Greenland
840
New Guinea
306
Borneo
280
제목 꾸미기 (마크다운 문법 md)
Code
df |>arrange(-size)|>slice(1:2) |>gt() |>tab_header(title =md("**Large Landmasses of the World**"),subtitle =md("The *top two* largest are presented") )
Large Landmasses of the World
The top two largest are presented
name
size
Asia
16988
Africa
11506
바닥글에 출처 넣기(tab_source_note)
Code
df |>arrange(-size)|>slice(1:10) |>gt() |>tab_header(title =md("**Large Landmasses of the World**"),subtitle =md("The *top two* largest are presented") ) |>tab_source_note(source_note ="Source: The World Almanac and Book of Facts, 1975, page 406." ) |>tab_source_note(source_note =md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.") )
Large Landmasses of the World
The top two largest are presented
name
size
Asia
16988
Africa
11506
North America
9390
South America
6795
Antarctica
5500
Europe
3745
Australia
2968
Greenland
840
New Guinea
306
Borneo
280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.
주석 넣기(tab_footnote)
Code
df |>arrange(-size)|>slice(1:10) |>gt() |>tab_header(title =md("**Large Landmasses of the World**"),subtitle =md("The *top two* largest are presented") ) |>tab_source_note(source_note ="Source: The World Almanac and Book of Facts, 1975, page 406." ) |>tab_source_note(source_note =md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.") ) |>tab_footnote(footnote ="The Americas.",locations =cells_body(columns = name, rows =3:4) ) |>tab_footnote(footnote ="The largest by area.",locations =cells_body(columns = size,rows = size ==max(size) ) ) |>tab_footnote(footnote ="The lowest by area.",locations =cells_body(columns = size,rows = size ==min(size) ) )
Large Landmasses of the World
The top two largest are presented
name
size
Asia
1 16988
Africa
11506
North America2
9390
South America2
6795
Antarctica
5500
Europe
3745
Australia
2968
Greenland
840
New Guinea
306
Borneo
3 280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.
1 The largest by area.
2 The Americas.
3 The lowest by area.
table 저장하기
Code
df |>arrange(-size)|>slice(1:10) |>gt() |>tab_header(title =md("**Large Landmasses of the World**"),subtitle =md("The *top two* largest are presented") ) |>tab_source_note(source_note ="Source: The World Almanac and Book of Facts, 1975, page 406." ) |>tab_source_note(source_note =md("Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley.") ) |>tab_footnote(footnote ="The Americas.",locations =cells_body(columns = name, rows =3:4) ) |>tab_footnote(footnote ="The largest by area.",locations =cells_body(columns = size,rows = size ==max(size) ) ) |>tab_footnote(footnote ="The lowest by area.",locations =cells_body(columns = size,rows = size ==min(size) ) ) |># gtsave(filename = "tab_1.html") |> gtsave("tab_1.png", expand =10)